The purpose of the text classification is to create a classifier that can classify a tweet as 'pro-protester' or 'pro-police'. These classifications will be used to calucate the percent of 'pro-protester' tweets for specific users. That data will then be used as a feature in a Random Forest Classification that will attempt to classify whether a user will remain engaged in the Ferguson conversion on twitter.
Resource and code help: Python 3 Text Processing with NLTK 3 Cookbook, Jacob Perkins
In [3]:
#imports and query mongodb
import json
import pymongo
from bson import json_util # From pymongo
import numpy as np
import pandas as pd
from datetime import datetime
from nltk.tokenize import word_tokenize
mdb = pymongo.MongoClient('mongodb://10.208.160.157')
db = mdb.ferguson
tweets = db.tweets
aug_tweets = db.tweets_aug
Specific users were chosen to train the classifier. The users were chosen based on the the content of their tweets. To be chosen for the training set, the majority of the user's Ferguson related content needed to be clearly 'pro-protester' or 'pro-police', and the user's tweets should not consist of primarily retweets.
In [4]:
#queries the database for the user's tweets and creates a dataframe for the user
def getTweetsAndLabel(name, label):
tweet_fields = ['id', '_iso_created_at', 'text' ]
user = tweets.find({"user.screen_name": name}, fields = tweet_fields).sort([("_iso_created_at", -1 )])
df = pd.DataFrame(list(user), columns = tweet_fields)
df['label'] = label
return df
A dataframe is generated with the tweets of each user, and each tweet is given a label based on whether that users is 'pro-protester' or 'pro-police'.
Dataframes for pro-protester users: 'p'
In [5]:
deray_df = getTweetsAndLabel('deray', 'p')
In [6]:
big6domino_df = getTweetsAndLabel('Big6domino', 'p')
In [7]:
wesleylowery_df = getTweetsAndLabel('WesleyLowery', 'p')
In [8]:
plussone_df = getTweetsAndLabel('plussone', 'p')
In [9]:
jsanbower_df = getTweetsAndLabel('JSanbower', 'p')
In [10]:
est_laced_up_df = getTweetsAndLabel('EST_Laced_Up', 'p')
In [11]:
lisaarmstrong_df = getTweetsAndLabel('LisaArmstrong', 'p')
In [12]:
judsontwit_df = getTweetsAndLabel('judsontwit', 'p')
In [13]:
#combine all pro-protester user dataframes
prot = [deray_df, big6domino_df, wesleylowery_df, plussone_df, est_laced_up_df, judsontwit_df, jsanbower_df, lisaarmstrong_df]
prot_df = pd.concat(prot)
len(prot_df)
Out[13]:
Dataframes for pro-police users: 'c'
In [14]:
hotnostrilsrfun_df = getTweetsAndLabel('HotNostrilsrFun', 'c')
In [15]:
rburg63_df = getTweetsAndLabel('rburg63', 'c')
In [16]:
msully65_df = getTweetsAndLabel('msully65', 'c')
In [17]:
anna12061_df = getTweetsAndLabel('anna12061', 'c')
In [18]:
scrufey21_df = getTweetsAndLabel('Scrufey21', 'c')
In [19]:
jake_bradford88_df = getTweetsAndLabel('jake_bradford88', 'c')
In [20]:
johncardillo_df = getTweetsAndLabel('johncardillo', 'c')
In [21]:
#combine all pro-police dataframes
pol = [hotnostrilsrfun_df, rburg63_df, msully65_df, anna12061_df, scrufey21_df, jake_bradford88_df, johncardillo_df]
pol_df = pd.concat(pol)
len(pol_df)
Out[21]:
In [22]:
#combine pro-protester and pro-police dataframes
users_df = []
df_all = pd.concat([prot_df, pol_df]).reset_index(drop = True)
len(df_all)
Out[22]:
Check the balance of the data set.
In [28]:
print "Pro-protestor tweets make up {0:.0f}% of the labeled data set.".format((len(prot_df)*1.0 / len(df_all))*100)
print "Pro-police tweets make up {0:.0f}% of the labeled data set.".format((len(pol_df)*1.0 / len(df_all))*100)
The text was prepared for training using the following steps:
Note that many of these steps could be optimized further. For instance, more or less bigrams could be used, more stop words could be added, and the analysis could be restricted to a specific number of most important words. However, the accuracy of the classifier was high enough that further optimization of the text processing was not necessary.
In [30]:
#prepare text using steps described above
def prepText(df):
#get the NLTK english stop words
from nltk.corpus import stopwords
english_stops = set(stopwords.words('english'))
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
for i in range (0, len(df)):
words = word_tokenize(df.loc[i, 'text'])
words = [w.lower() for w in words]
custom_stops = ['http', 'https', '...', 'amp']
words = [word for word in words if word not in english_stops and len(word) > 2 and word not in custom_stops and word.isalpha()]
bigram_finder = BigramCollocationFinder.from_words(words)
bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 10)
words = (words + bigrams)
#put the word list into the correct format for the classifier
words_dict = {word: True for word in words}
df.loc[i, 'train data'] = ''.join(str(words_dict))
return df
IPython parallel computing was used to make the text prepartion operation faster.
In [38]:
#set up parallel computing and confirm number of engines
from IPython import parallel
#make sure to enter the correct profile
clients = parallel.Client(profile='nbserver')
# use synchronous computations - all results must finish computing before any results are recorded
clients.block = True
dview = clients.direct_view()
print clients.ids
In [39]:
%px from nltk.tokenize import word_tokenize
dview.push((dict(prepText = prepText)))
dview.scatter('df_all', df_all)
dview.execute('df_all.reset_index(drop = True, inplace = True)')
dview.execute('df_all = prepText(df_all)')
all_list = dview.gather('df_all')
df_all = pd.concat(all_list).reset_index(drop = True)
print len(df_all)
df_all.head()
Out[39]:
In [40]:
#prepare train and test sets
import ast
import collections
from nltk import metrics
data_list = [] * len(df_all)
def prepTrainText(df):
import numpy as np
import sklearn
from sklearn.cross_validation import train_test_split
#convert strings in dataframe to dicts
for i in range(0, len(df)):
data_list.append((ast.literal_eval(df.loc[i]['train data']), df.loc[i]['label']))
train_data, test_data = sklearn.cross_validation.train_test_split(data_list, test_size = 0.33,random_state = 42)
return train_data, test_data
In [41]:
train_set, test_set = prepTrainText(df_all)
In [42]:
print "The length of of the training set is %d." % len(train_set)
print "The length of the test set is %d." % len(test_set)
Since this is a text classification problem we chose to train a NLTK Naive Bayes classifier. Naive Bayes is suggested as good starting point in the Python 3 Text Processing with NLTK 3 Cookbook.
In addition, the scikit-learn algorithm cheat sheet (http://scikit-learn.org/stable/tutorial/machine_learning_map/) shows Naive Bayes as a good choice for labeled text data.
In [33]:
#train classifier and show most informative features
from nltk.classify import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_set)
classifier.show_most_informative_features()
In [31]:
#function to calculate precision and recall
def precAndRec(classifier, test_feats):
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
precisions = {}
recalls = {}
for i, (feats, label) in enumerate(test_feats):
refsets[label].add(i)
observed = classifier.classify(feats)
testsets[observed].add(i)
for label in classifier.labels():
precisions[label] = metrics.precision(refsets[label], testsets[label])
recalls[label] = metrics.recall(refsets[label], testsets[label])
return precisions, recalls
In [34]:
#test classifier and output results
import nltk.classify.util
print 'Accuracy: %0.2f' % nltk.classify.util.accuracy(classifier, test_set)
prec, rec = precAndRec(classifier, test_set)
print "Precision for 'p': %0.2f" % prec['p']
print "Precision for 'c': %0.2f" % prec['c']
print "Recall for 'p': %0.2f" % rec['p']
print "Recall for 'c': %0.2f" % rec['c']
The accuracy, precision, and recall are all good. Therefore, no further work on text optmization will be done.
Scikit-learn also offers a Naive Byes classifier, and NLTK provides a wrapper class for it. Scikit-learn's Naive Bayes classifier has a smaller memory footprint. With the amount of data we're working with, memory considerations are important So, we also tried this classifier.
In [35]:
#sklearn Naive Bayes (usually better than NLTK, but slower)
from nltk.classify import scikitlearn
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB
sknb_classifier = SklearnClassifier(MultinomialNB())
sknb_classifier.train(train_set)
print nltk.classify.util.accuracy(sknb_classifier, test_set)
prec, rec = precAndRec(sknb_classifier, test_set)
print "Precision for 'p': %0.2f" % prec['p']
print "Precision for 'c': %0.2f" % prec['c']
print "Recall for 'p': %0.2f" % rec['p']
print "Recall for 'c': %0.2f" % rec['c']
The results are similiar. Because of the memory footprint, the sklearn Naive Bayes will be used for the classifications.
To confirm that the classifier is classifying tweets properly, additional users were identified as clearly pro-protester or pro-police. Their tweets were classified, and the results are below.
Note: 'p' = 'pro-protester' 'c' = 'pro-police'
In [37]:
#determine percent of user's tweets that are pro-protester
def testUser(name, classifier):
df = getTweetsAndLabel(name, '')
df_ready = prepText(df)
for i in range(0, len(df_ready)):
df_ready.loc[i, 'label'] = classifier.classify(ast.literal_eval(df_ready.loc[i]['train data']))
p = len(df_ready[df_ready['label'] == 'p'])
c = len(df_ready[df_ready['label'] == 'c'])
t = p + c
perc_p = p*1.0 / t
return perc_p
In [38]:
x = testUser('AntonioFrench', sknb_classifier)
print x
print "Expected: 'p'"
In [39]:
x = testUser('FredSanford13', sknb_classifier)
print x
print "Expected: 'c'"
In [40]:
x = testUser('CAC8438', sknb_classifier)
print x
print "Expected: 'c'"
In [41]:
x = testUser('OpFerguson', sknb_classifier)
print x
print "Expected: 'p'"
In [42]:
x = testUser('ryanjreilly', sknb_classifier)
print x
print "Expected: 'p'"
In [43]:
x = testUser('Nettaaaaaaaa', sknb_classifier)
print x
print "Expected: 'p'"
In [44]:
x = testUser('timjacobwise', sknb_classifier)
print x
print "Expected: 'p'"
In [45]:
x = testUser('1969WAR1971', sknb_classifier)
print x
print "Expected: 'c'"
In [46]:
x = testUser('rdipego', sknb_classifier)
print x
print "Expected: 'c'"
Each user was classified as expected. If 'p' was expected, the percent of pro-protester tweets should be above 0.5. If 'c' was expected, it should be below 0.5.
Notes on the data used to train the text classifier:
In other parts of this project, restrictions are being made to the data that have not been made to the data set used to train the text classifier. These restrictions include only looking at tweets with the hashtag #Ferguson or #ferguson (versus containing the word F/ferguson, even if not as a hashtag), and restricting the data to tweets from August.
For the purpose of text classification, the context of the tweets is the most important factor. The presence of #Ferguson or #ferguson hastags is not important, nor is the date of the tweets. The context of the tweets remains largely the same through the time period being analysed. The high accuracy, precision, and recall of the classifier confirms that these restrictions of the data are not necessary for text classification.
In [72]:
#get users from August data
df_aug = pd.read_csv('/home/data/august_reduced.csv', error_bad_lines = False)
In [76]:
#function to get tweets for each user using August data set
def getTweetsAndLabelAug(name, label):
df_users_tweets = df_aug[df_aug['user.screen_name'] == name].reset_index(drop = True)
df_users_tweets ['label'] = label
return df_users_tweets
In [77]:
#
def testUserAug(name, classifier):
df = getTweetsAndLabelAug(name, '')
df_ready = prepText(df)
for i in range(0, len(df_ready)):
df_ready.loc[i, 'label'] = classifier.classify(ast.literal_eval(df_ready.loc[i]['train data']))
p = len(df_ready[df_ready['label'] == 'p'])
c = len(df_ready[df_ready['label'] == 'c'])
t = p + c
perc_p = p*1.0 / t
return perc_p
A few additional users identified during EDA were tested as a sanity check.
In [78]:
x = testUserAug('gerfingerpoken2', sknb_classifier)
print x
print "Expected: 'c'"
In [79]:
x = testUserAug('carolynsbuddy', sknb_classifier)
print x
print "Expected: 'p'"
In [80]:
x = testUserAug('PhilDeCarolis', sknb_classifier)
print x
print "Expected: '?'"
@PhilDeCarolis is interesting. He appears to be a liberatarian and supporter of Ron Paul. Most of his tweets regarding Ferguson are retweets of news headlines. He doesn't seem to outright support either the protestors or the police.
The tweets of 5,000 random users with ten or more tweets in August will be classified. The sample number of 5,000 was chosen based on time and server memory constraints. The set of tweets from 5,000 users would take about 2 - 2.5 hours to classify, and would occasionally cause the IPython kernel to crash. Larger servers were tried, and 5,000 was the limit to classify in a reasonable amount of time without kernel crashes on the largest server available to the team.
IPython's parallel computing would have been nice to use for this lengthy process. However, when we attempted to scatter the work to the available cores, the process was stopped by memory errors. This may have been caused by the amount of data that needed to be pushed to each core to run the process.
In [81]:
#Generate dataframe with users from August
users = pd.DataFrame({'count' : df_aug.groupby( [ "user.screen_name"] ).size()}).reset_index()
In [82]:
#restrict to users with tweet counts greater than 10, and randomly select 5000
users = users[pd.notnull(users['count'])]
users = users[users['count'] >= 10]
rows = np.random.choice(users.index.values, 5000, replace=False)
ran_users = users.ix[rows].reset_index(drop = True)
In [83]:
print len(ran_users)
ran_users.head()
Out[83]:
In [84]:
#function to classify tweets and caluclate percent per user
def label_all(df):
for i in range(0, len(df)):
x = testUserAug(df.loc[i]['user.screen_name'], sknb_classifier)
df.loc[i, 'perc_p'] = x
return df
In [85]:
#classify tweets for users
final = label_all(ran_users)
In [86]:
final.head()
Out[86]:
In [87]:
len(final)
Out[87]:
In [93]:
#save dataframe for use in another notebook
final.to_pickle('final_aug_percents.pkl')
In [52]:
#calculate number of users that are pro-protester and number that are pro-police
p = len(final[final['perc_p'] >= 0.5])
c = len(final[final['perc_p'] < 0.5])
print "Percent of users that are pro-protester: {0:.0f}%".format(p*1.0/(p+c)*100)
print "Percent of users that are pro-police: {0:.0f}%".format(c*1.0/(c+p)*100)